sessionInfo()Biostat 203B Homework 2
Due Feb 9 @ 11:59PM
Display machine information for reproducibility:
Load necessary libraries (you can add more as needed).
library(arrow)
library(data.table)
library(memuse)
library(pryr)
library(R.utils)
library(tidyverse)
# added library
library(readr)
library(duckdb)
library(dplyr)Display memory information of your computer
memuse::Sys.meminfo()In this exercise, we explore various tools for ingesting the MIMIC-IV data introduced in homework 1.
Display the contents of MIMIC hosp and icu data folders:
ls -l ~/mimic/hosp/ls -l ~/mimic/icu/Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)
Q1.1 Speed, memory, and data types
There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three functions: read.csv in base R, read_csv in tidyverse, and fread in the data.table package.
Answer:
Which function is fastest? Is there difference in the (default) parsed data types? How much memory does each resultant dataframe or tibble use? (Hint: system.time measures run times; pryr::object_size measures memory usage.)
admissions_base <- read.csv("~/mimic/hosp/admissions.csv.gz")
admissions_base
system.time(admissions_base <- read.csv("~/mimic/hosp/admissions.csv.gz"))
pryr::object_size(admissions_base)admissions_tidy <- read_csv("~/mimic/hosp/admissions.csv.gz")
admissions_tidy
system.time(read_csv("~/mimic/hosp/admissions.csv.gz"))
pryr::object_size(admissions_tidy)admissions_data_table <- fread("~/mimic/hosp/admissions.csv.gz")
admissions_data_table
system.time(admissions_data_table <- fread("~/mimic/hosp/admissions.csv.gz"))
pryr::object_size(admissions_data_table)Therefore, the fastest function is fread in the data.table package.
The default parsed data types are different among the three functions that read.csv in base R and fread in the data.table package parses some numeric columns as integer, while read_csv in tidyverse parses some numeric columns as double. And the time variable is parsed as character in read.csv , while them is parsed as datetime(S3: POSIXct) in read_csv and fread.
The memory usage of the resultant dataframe or tibble is 158.71 MB for read.csv in base R, 55.31 MB for read_csv in tidyverse, and 50.13 MB for fread in data.table.
Q1.2 User-supplied data types
Re-ingest admissions.csv.gz by indicating appropriate column data types in read_csv. Does the run time change? How much memory does the result tibble use? (Hint: col_types argument in read_csv.)
Answer:
library(readr)
col_types <- cols(
admission_type = col_character(),
admission_location = col_character(),
discharge_location = col_character(),
insurance = col_character(),
language = col_character(),
marital_status = col_character(),
race = col_character(),
subject_id = col_double(),
hadm_id = col_double(),
hospital_expire_flag = col_double(),
admittime = col_datetime(),
dischtime = col_datetime(),
deathtime = col_datetime(),
edregtime = col_datetime(),
edouttime = col_datetime()
)
admissions_tidy <- read_csv("~/mimic/hosp/admissions.csv.gz",
col_types = col_types)
system.time(read_csv("~/mimic/hosp/admissions.csv.gz", col_types = col_types)) user system elapsed
0.829 0.073 0.467
nrow(admissions_tidy)[1] 431231
pryr::object_size(admissions_tidy)55.31 MB
The run time changes from user:0.894 sys:0.104 elapsed:0.999 to user:0.799 sys:0.085 elapsed: 0.443 on my computer. The memory usage of the resultant tibble is 55.31 MB.
Q2. Ingest and filter big data files
Let us focus on a bigger file, labevents.csv.gz, which is about 125x bigger than admissions.csv.gz.
ls -l ~/mimic/hosp/labevents.csv.gzDisplay the first 10 lines of this file.
zcat < ~/mimic/hosp/labevents.csv.gz | head -10Q2.1 Ingest labevents.csv.gz by read_csv
Try to ingest labevents.csv.gz using read_csv. What happens? If it takes more than 5 minutes on your computer, then abort the program and report your findings.
Answer:
labevents_tidy <- read_csv("~/mimic/hosp/labevents.csv.gz")
system.time(labevents_tidy)
pryr::object_size(labevents_tidy)This code doesn’t finish within 5 minutes on my computer. The reason is that the file is too large to be ingested by read_csv in tidyverse. The read_csv function in tidyverse takes more than 5 minutes to ingest labevents.csv.gz on my computer. Therefore, I abort the program and report my findings. If not aborted, the output is as follows: Error: vector memory exhausted (limit reached?).
Q2.2 Ingest selected columns of labevents.csv.gz by read_csv
Try to ingest only columns subject_id, itemid, charttime, and valuenum in labevents.csv.gz using read_csv. Does this solve the ingestion issue? (Hint: col_select argument in read_csv.)
Answer:
system.time({
labevents_tidy <- read_csv("~/mimic/hosp/labevents.csv.gz", col_select =
c(subject_id, itemid, charttime, valuenum))
}) user system elapsed
92.289 93.330 117.804
pryr::object_size(labevents_tidy)3.78 GB
This solution solves the ingestion issue on my computer.
The run time is user: 116.788 sys: 112.927 elapsed: 136.609.
so it takes about 136.609 seconds for read_csv to ingest labevents.csv.gz on my computer.
The memory usage of the resultant tibble is 3.78 GB.
Q2.3 Ingest subset of labevents.csv.gz
Our first strategy to handle this big data file is to make a subset of the labevents data. Read the MIMIC documentation for the content in data file labevents.csv.
In later exercises, we will only be interested in the following lab items: creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931) and the following columns: subject_id, itemid, charttime, valuenum. Write a Bash command to extract these columns and rows from labevents.csv.gz and save the result to a new file labevents_filtered.csv.gz in the current working directory. (Hint: use zcat < to pipe the output of labevents.csv.gz to awk and then to gzip to compress the output. To save render time, put #| eval: false at the beginning of this code chunk.)
Display the first 10 lines of the new file labevents_filtered.csv.gz. How many lines are in this new file? How long does it take read_csv to ingest labevents_filtered.csv.gz?
Answer:
zcat < ~/mimic/hosp/labevents.csv.gz | \
awk -F, 'BEGIN {OFS=","} NR==1 || \
($5==50912 || $5==50971 || $5==50983 || \
$5==50902 || $5==50882 || $5==51221 || \
$5==51301 || $5==50931) {print $2, $5, $7, $10}' | \
gzip > ~/mimic/hosp/labevents_filtered.csv.gz#Display the first 10 lines and the number of lines in this new file
cd ~/mimic/hosp/
zcat < labevents_filtered.csv.gz 2>/dev/null | head -10
zcat < labevents_filtered.csv.gz | wc -lsubject_id,itemid,charttime,valuenum
10000032,50882,2180-03-23 11:51:00,27
10000032,50902,2180-03-23 11:51:00,101
10000032,50912,2180-03-23 11:51:00,0.4
10000032,50971,2180-03-23 11:51:00,3.7
10000032,50983,2180-03-23 11:51:00,136
10000032,50931,2180-03-23 11:51:00,95
10000032,51221,2180-03-23 11:51:00,45.4
10000032,51301,2180-03-23 11:51:00,3
10000032,51221,2180-05-06 22:25:00,42.6
24855910
#How long does it take read_csv to ingest labevents_filtered.csv.gz?
system.time(read_csv("~/mimic/hosp/labevents_filtered.csv.gz")) user system elapsed
10.828 1.249 3.555
It takes about 3.5 seconds for read_csv to ingest labevents_filtered.csv.gz on my computer.
Q2.4 Ingest labevents.csv by Apache Arrow
Our second strategy is to use Apache Arrow for larger-than-memory data analytics. Unfortunately Arrow does not work with gz files directly. First decompress labevents.csv.gz to labevents.csv and put it in the current working directory. To save render time, put #| eval: false at the beginning of this code chunk.
Then use arrow::open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3. How long does the ingest+select+filter process take? Display the number of rows and the first 10 rows of the result tibble, and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)
Answer:
zcat < ~/mimic/hosp/labevents.csv.gz > ~/labevents.csvlibrary(dplyr)
system.time({
labevents <- arrow::open_dataset("~/labevents.csv", format = "csv") %>%
filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) %>%
select(subject_id, itemid, charttime, valuenum) %>%
collect()
}) user system elapsed
33.021 2.462 31.137
labevents_filtered <- labevents %>%
arrange(subject_id, charttime, itemid)
num_rows <- nrow(labevents_filtered)
cat("Number of rows:", num_rows, "\n")Number of rows: 24855909
head(labevents_filtered, 10)# A tibble: 10 × 4
subject_id itemid charttime valuenum
<int> <int> <dttm> <dbl>
1 10000032 50882 2180-03-23 04:51:00 27
2 10000032 50902 2180-03-23 04:51:00 101
3 10000032 50912 2180-03-23 04:51:00 0.4
4 10000032 50931 2180-03-23 04:51:00 95
5 10000032 50971 2180-03-23 04:51:00 3.7
6 10000032 50983 2180-03-23 04:51:00 136
7 10000032 51221 2180-03-23 04:51:00 45.4
8 10000032 51301 2180-03-23 04:51:00 3
9 10000032 50882 2180-05-06 15:25:00 27
10 10000032 50902 2180-05-06 15:25:00 105
The ingest+select+filter process takes about 31 seconds on my computer. The number of rows is 24855909.
Write a few sentences to explain what is Apache Arrow. Imagine you want to explain it to a layman in an elevator.
Apache Arrow is like a universal translator for data in the digital world. It helps different computer programs speak the same language when sharing and processing data, making communication faster and more efficient. With Apache Arrow, data can move seamlessly between different systems without losing its structure or meaning, just like people can understand each other even if they speak different languages.
Q2.5 Compress labevents.csv to Parquet format and ingest/select/filter
Re-write the csv file labevents.csv in the binary Parquet format (Hint: arrow::write_dataset.) How large is the Parquet file(s)? How long does the ingest+select+filter process of the Parquet file(s) take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)
# Write the CSV file in the binary Parquet format
labevents <- arrow::open_dataset("~/labevents.csv", format = "csv")
arrow::write_dataset(labevents, "~/labevents.parquet")
# the time of the ingest+select+filter process of the Parquet file(s)
system.time({
labevents_parquet <- arrow::open_dataset("~/labevents.parquet") %>%
dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882,
51221, 51301, 50931)) %>%
dplyr::select(subject_id, itemid, charttime, valuenum) %>%
collect()
}) user system elapsed
8.960 3.012 3.069
# the size of the Parquet file(s)
object.size(labevents_parquet)596543360 bytes
cat ("The size of the Parquet file(s) is", object.size(labevents_parquet), "\n")The size of the Parquet file(s) is 596543360
labevents_filter_parquet <- labevents_parquet %>%
arrange(subject_id, charttime, itemid)
# display the number of rows and the first 10 rows of the result tibble
num_rows <- nrow(labevents_filter_parquet)
cat("Number of rows:", num_rows, "\n")Number of rows: 24855909
head(labevents_filter_parquet, 10)# A tibble: 10 × 4
subject_id itemid charttime valuenum
<int> <int> <dttm> <dbl>
1 10000032 50882 2180-03-23 04:51:00 27
2 10000032 50902 2180-03-23 04:51:00 101
3 10000032 50912 2180-03-23 04:51:00 0.4
4 10000032 50931 2180-03-23 04:51:00 95
5 10000032 50971 2180-03-23 04:51:00 3.7
6 10000032 50983 2180-03-23 04:51:00 136
7 10000032 51221 2180-03-23 04:51:00 45.4
8 10000032 51301 2180-03-23 04:51:00 3
9 10000032 50882 2180-05-06 15:25:00 27
10 10000032 50902 2180-05-06 15:25:00 105
Write a few sentences to explain what is the Parquet format. Imagine you want to explain it to a layman in an elevator.
Parquet is like a digital storage format that arranges data in a very organized and compact manner, making it easy for computers to access and work with. It’s like a streamlined filing system for data, ensuring that information is stored efficiently and can be retrieved quickly when needed.
Q2.6 DuckDB
Ingest the Parquet file, convert it to a DuckDB table by arrow::to_duckdb, select columns, and filter rows as in Q2.5. How long does the ingest+convert+select+filter process take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)
labevents_parquet <- arrow::open_dataset("~/labevents.parquet")
system.time({
labevents_duckdb <- labevents_parquet %>%
arrow::to_duckdb(table_name = "labevents_table") %>%
dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301,
50931)) %>%
dplyr::select(subject_id, itemid, charttime, valuenum) %>%
collect()
})
labevents_filter_duckdb <- labevents_duckdb %>%
arrange(subject_id, charttime, itemid)
num_rows <- nrow(labevents_filter_duckdb)
cat("Number of rows:", num_rows, "\n")
head(labevents_filter_duckdb, 10)Write a few sentences to explain what is DuckDB. Imagine you want to explain it to a layman in an elevator.
DuckDB is like a super-fast calculator for data. It’s a special tool that helps computers crunch numbers and analyze information really quickly. Just as a calculator helps you do math problems faster, DuckDB helps programs process and understand large amounts of data with lightning speed.
Q3. Ingest and filter chartevents.csv.gz
chartevents.csv.gz contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are
zcat < ~/mimic/icu/chartevents.csv.gz | head -10d_items.csv.gz is the dictionary for the itemid in chartevents.csv.gz.
zcat < ~/mimic/icu/d_items.csv.gz | head -10In later exercises, we are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items, using the favorite method you learnt in Q2.
Document the steps and show code. Display the number of rows and the first 10 rows of the result tibble.
Step 1: Decompress chartevents.csv.gz to chartevents.csv
# decompress chartevents.csv.gz to chartevents.csv
zcat < ~/mimic/icu/chartevents.csv.gz > ~/chartevents.csvStep 2: Ingest and filter chartevents.csv and display the number of rows of the result
# choose Parquet format
chartevents <- arrow::open_dataset("~/chartevents.csv", format = "csv")
arrow::write_dataset(chartevents, "~/chartevents.parquet")
# display the time of the process
system.time({
chartevents_parquet <- arrow::open_dataset("~/chartevents.parquet") %>%
dplyr::filter(itemid %in% c(220045, 220181, 220179, 223761, 220210)) %>%
dplyr::select(subject_id, itemid, charttime, valuenum) %>%
collect()
}) user system elapsed
18.765 5.333 4.830
chartevents_filtered_parquet <- chartevents_parquet %>%
arrange(subject_id, charttime, itemid)
# Step 3: display the number of rows and the first 10 rows of the result tibble
num_rows <- nrow(chartevents_filtered_parquet)
cat("Number of rows:", num_rows, "\n")Number of rows: 22502319
head(chartevents_filtered_parquet, 10)# A tibble: 10 × 4
subject_id itemid charttime valuenum
<int> <int> <dttm> <dbl>
1 10000032 223761 2180-07-23 07:00:00 98.7
2 10000032 220179 2180-07-23 07:11:00 84
3 10000032 220181 2180-07-23 07:11:00 56
4 10000032 220045 2180-07-23 07:12:00 91
5 10000032 220210 2180-07-23 07:12:00 24
6 10000032 220045 2180-07-23 07:30:00 93
7 10000032 220179 2180-07-23 07:30:00 95
8 10000032 220181 2180-07-23 07:30:00 67
9 10000032 220210 2180-07-23 07:30:00 21
10 10000032 220045 2180-07-23 08:00:00 94